Clustering: K-Means Clustering

import modules



In [8]:

    
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import numpy as np

%matplotlib inline
np.set_printoptions(precision=5, suppress=True)

Generate Data

We firstly randomly generate data from 4 clusters



In [9]:

    
np.random.seed(42)

def init(n, center):
    data = np.random.multivariate_normal(center, [[4, 1], [1, 4]], size=[n,])
    return data

data = np.concatenate((init(200, [9, 0]), init(200, [9, 9]), init(200, [0, 9]), init(200, [0, 0])),)

print(data.shape)
plt.scatter(data[:,0], data[:,1])
plt.show()

KMeans Clustering



In [10]:

    
# plot the kmeans results
def kmeans_plot(data, kmeans):
    h = .02
    x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
    y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure(1)
    plt.clf()
    plt.imshow(Z, interpolation='nearest',
               extent=(xx.min(), xx.max(), yy.min(), yy.max()),
               cmap=plt.cm.Paired,
               aspect='auto', origin='lower')
    plt.plot(data[:, 0], data[:, 1], 'k.', markersize=2)

    centroids = kmeans.cluster_centers_ # get the centroids
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='w', zorder=10)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.show()



In [11]:

    
## n_clusters: Number of clusters
## max_iter: Maximum number of iterations of the k-means algorithm for a single run.
## n_init: Number of time the k-means algorithm will be run with different centroid seeds.
## init: How to select initial centers
kmeans = KMeans(n_clusters = 4, max_iter = 100, n_init = 10, init='k-means++', random_state=42)
kmeans.fit(data)
kmeans_plot(data, kmeans)
print('Sum of distances to the centroids: {}'.format(kmeans.inertia_))









    












    



Sum of distances to the centroids: 6000.446294702526

Convergence of KMeans



In [12]:

    
init_array = np.asarray([[-10, -10], [-9, -10], [-10, -9], [-9, -9]])

for i in range(10):
    #kmeans = KMeans(n_clusters = 4, max_iter = i+1, n_init = 1, init='k-means++', random_state=42)
    #kmeans = KMeans(n_clusters = 4, max_iter = i+1, n_init = 1, init='random', random_state=42)
    kmeans = KMeans(n_clusters = 4, max_iter = i+1, n_init = 1, init=init_array, random_state=42)
    kmeans.fit(data)
    kmeans_plot(data, kmeans)
    print('Sum of distances to the centroids: {}'.format(kmeans.inertia_))









    












    



Sum of distances to the centroids: 33212.04811201549






    












    



Sum of distances to the centroids: 26375.361144357335






    












    



Sum of distances to the centroids: 21958.14444974088






    












    



Sum of distances to the centroids: 13393.587390493503






    












    



Sum of distances to the centroids: 7059.640912441712






    












    



Sum of distances to the centroids: 6041.048659959708






    












    



Sum of distances to the centroids: 6003.121308761287






    












    



Sum of distances to the centroids: 6000.446294702526






    












    



Sum of distances to the centroids: 6000.446294702526






    












    



Sum of distances to the centroids: 6000.446294702526

KMeans with different number of clusters



In [15]:

    
d = []
for i in range(30):
    kmeans = KMeans(n_clusters = i+1, max_iter = 300, n_init = 20, init='k-means++', random_state=42)
    kmeans.fit(data)
    kmeans_plot(data, kmeans)
    print('Sum of distances to the centers: {}'.format(kmeans.inertia_))
    d.append([i+1, kmeans.inertia_])
d = np.asarray(d)









    












    



Sum of distances to the centers: 39170.27060105717






    












    



Sum of distances to the centers: 21816.904690713964






    












    



Sum of distances to the centers: 13188.873581668335






    












    



Sum of distances to the centers: 6000.446294702526






    












    



Sum of distances to the centers: 5343.385885101858






    












    



Sum of distances to the centers: 4700.905596456336






    












    



Sum of distances to the centers: 4117.372937116729






    












    



Sum of distances to the centers: 3638.4756241441314






    












    



Sum of distances to the centers: 3277.8398982437816






    












    



Sum of distances to the centers: 3017.22249671582






    












    



Sum of distances to the centers: 2745.601994163262






    












    



Sum of distances to the centers: 2506.8646375297867






    












    



Sum of distances to the centers: 2346.372239090475






    












    



Sum of distances to the centers: 2194.8180696319905






    












    



Sum of distances to the centers: 2059.5360734498154






    












    



Sum of distances to the centers: 1946.1649847570511






    












    



Sum of distances to the centers: 1804.23516009289






    












    



Sum of distances to the centers: 1709.6097829250732






    












    



Sum of distances to the centers: 1633.5546591357602






    












    



Sum of distances to the centers: 1588.5584015491172






    












    



Sum of distances to the centers: 1508.4280947989364






    












    



Sum of distances to the centers: 1426.6927057358512






    












    



Sum of distances to the centers: 1371.7031823984767






    












    



Sum of distances to the centers: 1315.9349428824564






    












    



Sum of distances to the centers: 1266.420955268581






    












    



Sum of distances to the centers: 1213.7995924954648






    












    



Sum of distances to the centers: 1171.1652201351178






    












    



Sum of distances to the centers: 1108.8464429026685






    












    



Sum of distances to the centers: 1089.2278042025164






    












    



Sum of distances to the centers: 1050.9930506549854



In [16]:

    
plt.plot(d[:,0], d[:,1])









    Out[16]:





[<matplotlib.lines.Line2D at 0x11ee5f518>]

As we can see from the plot above, there is an obvious turning point at k = 4, and the cost starts to enter into a plateau. This is a simple way of selectiving k.



In [ ]: